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Abstract: In this manuscript, we study the statistical properties of convex clustering. 

We establish that convex clustering is closely related to single linkage hierarchical clus¬ 
tering and Ai-means clustering. In addition, we derive the range of the tuning parameter 
for convex clustering that yields a non-trivial solution. We also provide an unbiased 
estimator of the degrees of freedom, and provide a finite sample bound for the pre¬ 
diction error for convex clustering. We compare convex clustering to some traditional 
clustering methods in simulation studies. 
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1. Introduction 

Let X £ R" xp be a data matrix with n observations and p features. We assume for conve¬ 
nience that the rows of X are unique. The goal of clustering is to partition the n observations 
into K clusters, D \,..., Dk, based on some similarity measure. Traditional clustering meth¬ 
ods such as hierarchical clustering, fc-means clustering, and spectral clustering take a greedy 
approach (see, e.g., Hastie, Tibshirani and Friedman, 2009). 

In recent years, several authors have proposed formulations for convex clustering (Pelck- 
mans et al., 2005; Hocking et al., 2011; Lindsten, Ohlsson and Ljung, 2011; Chi and Lange, 
2014a). Chi and Lange (2014a) proposed efficient algorithms for convex clustering. In addi¬ 
tion, Radchenko and Mukherjee (2014) studied the theoretical properties of a closely related 
problem to convex clustering, and Zhu et al. (2014) studied the condition needed for convex 
clustering to recover the correct clusters. 

Convex clustering of the rows, Xi.,..., X„ , of a data matrix X involves solving the convex 
optimization problem 

1 " 

minimize - V] ||Xj. - U^.||| + AQ q (U), (1) 

UGffi nXp 2 z —' 
i—1 

where Q q (U) = ||Ui. — U.;/.|| g for q £ {1,2, oo}. The penalty Q q (U) generalizes the 

fused lasso penalty proposed in Tibshirani et al. (2005), and encourages the rows of U, the 
solution to (1), to take on a small number of unique values. On the basis of U, we define the 
estimated clusters as follows. 
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Definition 1. The ith and i'th observations are estimated by convex clustering to belong 
to the same cluster if and only if Uj. = Uj/.. 

The tuning parameter A controls the number of unique rows of U, i.e., the number of 
estimated clusters. When A = 0, U = X, and so each observation belongs to its own cluster. 
As A increases, the number of unique rows of U will decrease. For sufficiently large A, all 
rows of U will be identical, and so all observations will be estimated to belong to a single 
cluster. Note that (1) is strictly convex, and therefore the solution U is unique. 

To simplify our analysis of convex clustering, we rewrite (1). Let x = vec(X) £ R rap 
and let u = vec(U) £ R np , where the vec(-) operator is such that £(j_i ) p+ j = and 

U(i_i) p+ j = Uij. Construct D £ R[ p '( 2 )] xnp ; and define the index set C(i,i') such that the 

pxnp submatrix Dg^jq satisfies D^ *qu = Uj. — Uj/.. Furthermore, for a vector b £ R p '( 2 ), 
we define 

P 9 (b)=El|bc ( Mqll 9 - ( 2 ) 

i<i' 

Thus, we have P g (Du) = £j<j, ||D c(M qu|| g = £j <<f ||U, - Uj/JIg = Q,(U). Problem (1) 
can be rewritten as 

minimize-llx — ulln + AP„(Du). (3) 

ugR"P 2 

When q = 1, (3) is an instance of the generalized lasso problem studied in Tibshirani and 
Taylor (2011). Let u be the solution to (3). By Definition 1, the ith and i'th observations 
belong to the same cluster if and only if Dg^qu = 0. In what follows, we work with (3) 
instead of (1) for convenience. 

Let Di £ R” px [ p ( 2 )] be the Moore-Penrose pseudo-inverse of D. We state some properties 
of D and that will prove useful in later sections. 

Lemma 1. The matrices D and have the following properties. 

(i) rank(D) = p{n — 1). 

(li) Dt = ±D r . 

(in) (D T D)tD T = Dt and (DD T )tD = (D T )t. 

(iv) D(D J D)tD 7 = -DD J is a projection matrix onto the column space o/D. 

(v) Define A m j n (D) and A max (D) as the minimum non-zero singular value and maximum 
singular value of the matrix D, respectively. Then, A m i„(D) = A max (D) = yfn. 

In this manuscript, we study the statistical properties of convex clustering. In Section 2, 
we study the dual problem of (3) and use it to establish that convex clustering is closely re¬ 
lated to single linkage hierarchical clustering. In addition, we establish a connection between 
fc-means clustering and convex clustering. In Section 3, we present some properties of convex 
clustering. More specifically, we characterize the range of the tuning parameter A in (3) such 
that convex clustering yields a non-trivial solution. We also provide a finite sample bound 
for the prediction error, and an unbiased estimator of the degrees of freedom for convex clus¬ 
tering. In Section 4, we conduct numerical studies to evaluate the empirical performance of 
convex clustering relative to some existing proposals. We close with a discussion in Section 5. 

2. Convex Clustering, Single Linkage Hierarchical Clustering, and fc-means 
Clustering 

In Section 2.1, we study the dual problem of convex clustering (3). Through its dual problem, 
we establish a connection between convex clustering and single linkage hierarchical clustering 
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in Section 2.2. We then show that convex clustering is closely related to k- means clustering 
in Section 2.3. 


2.1. Dual Problem of Convex Clustering 

We analyze convex clustering (3) by studying its dual problem. Let s,q £ {1,2, oo} satisfy 
f i = 1. For a vector b £ R p '( 2 ), let P*(b) denote the dual norm of P 9 (b), which takes 
the form 

p ,( b ) =max ||b c(i)i ,)|| s . (4) 

We refer the reader to Chapter 6 in Boyd and Vandenberghe (2004) for an overview of the 
concept of duality. 

Lemma 2. The dual problem of convex clustering (3) is 

minimize -||x — D t iz ||2 subject to P*(iz) < A, (5) 

„erHS)] 2 


where v £ is the dual variable. Furthermore, let u and v be the solutions to (3) and 

(5), respectively. Then, 

Du = Dx-DD T i>. ( 6 ) 

While (3) is strictly convex, its dual problem (5) is not strictly convex, since D is not 
of full rank by Lemma l(i). Therefore, the solution i> to (5) is not unique. Lemma l(iv) 
indicates that p DD r is a projection matrix onto the column space of D. Thus, the solution 
Du in ( 6 ) can be interpreted as the difference between Dx, the pairwise difference between 
rows of X, and the projection of a dual variable onto the column space of D. 

We now consider a modification to the convex clustering problem (3). Recall from Defini¬ 
tion 1 that the zth and z’th observations are in the same estimated cluster if De^yjU = 0. 
This motivates us to estimate 7 = Du directly by solving 

minimize ^||Dx - -rill + AP 9 (7)- (?) 

76tr( 2 )l 

We establish a connection between (3) and (7) by studying the dual problem of (7). 
Lemma 3. The dual problem of (7) is 

minimize -||Dx — v'\\\ subject to P*(z/) < A, ( 8 ) 

^erKS)] 2 

where v' £ is the dual variable. Furthermore, let 7 and u be the solutions to (7) and 

(8), respectively. Then, 

7 = Dx — v . (9) 

Comparing ( 6 ) and (9), we see that the solutions to convex clustering (3) and the modified 
problem (7) are closely related. In particular, both Du in ( 6 ) and 7 in (9) involve taking the 
difference between Dx and some function of a dual variable that has PJ(-) norm less than or 
equal to A. The main difference is that in ( 6 ), the dual variable is projected into the column 
space of D. 
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Problem (7) is quite simple, and in fact it amounts to a thresholding operation on Dx 
when q = 1 or q = 2, i.e., the solution 7 is obtained by performing soft thresholding on Dx, 
or group soft thresholding on Dc(i,i')X for all i < i', respectively (Bach et al., 2011). When 
q = oo, an efficient algorithm was proposed by Duchi and Singer (2009). 

2.2. Convex Clustering and Single Linkage Hierarchical Clustering 

In this section, we establish a connection between convex clustering and single linkage hi¬ 
erarchical clustering. Let 7 9 be the solution to (7) with P g (-) norm and let s,q £ {l, 2 ,oo} 
satisfy J J = 1. Since (7) is separable in 'Yc(i.i') f° r * < i't by Lemma 2.1 in Haris, 
Witten and Simon (2015), it can be verified that 

= 0 if and only if H Xi - - X *'-ll« ^ X ( 10 ) 

It might be tempting to conclude that a pair of observations (i, i r ) belong to the same 
cluster if 7 ^ = 0. However, by inspection of (10), it could happen that 7 ^ ^ = 0 and 

= °> but ± °' 

To overcome this problem, we define the n x n adjacency matrix A 9 (A) as 

(1 if i = i', 

[A 9 (A)W = Jl if7! (M 0 =0, ( 11 ) 

1° if T 

Subject to a rearrangement of the rows and columns, A 9 (A) is a block-diagonal matrix with 
some number of blocks, denoted as R. On the basis of A 9 (A), we define R estimated clusters: 
the indices of the observations in the rth cluster are the same as the indices of the observations 
in the rth block of A 9 (A). 

We now present a lemma on the equivalence between single linkage hierarchical clustering 
and the clusters identified by (7) using (11). The lemma follows directly from the definition 
of single linkage clustering (see, for instance, Chapter 3.2 of Jain and Dubes, 1988). 

Lemma 4. Let Ei,..., Er index the blocks within the adjacency matrix A q (X). Let s satisfy 
J J = 1. Let £>i ,..., Dk denote the clusters that result from performing single linkage 
hierarchical clustering on the dissimilarity matrix defined by the pairwise distance between 
the observations ||Xj. — Xj/|| s , and cutting the dendrogram at the height of A > 0. Then 
K — R, and there exists a permutation 7 r : {1,..., K} —► {1,..., A'} such that Dk = E n ^) 
for k = 1,..., K. 

In other words, Lemma 4 implies that single linkage hierarchical clustering and (7) yield 
the same estimated clusters. Recalling the connection between (3) and (7) established in 
Section 2.1, this implies a close connection between convex clustering and single linkage 
hierarchical clustering. 

2.3. Convex Clustering and k-Means Clustering 

We now establish a connection between convex clustering and fc-means clustering, k -means 
clustering seeks to partition the n observations into K clusters by minimizing the within clus¬ 
ter sum of squares. That is, the clusters are given by the partition D 1 ,..., Dk of {1,..., n} 
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that solves the optimization problem 


K 

,.STS... ,u K E E » x .. - Ml 

k=l i£Dk 


We consider convex clustering (1) with q = 0, 

1 n 


minimize 

u eR nx p 2 


( 12 ) 


(13) 


where I(U,. ^ Uj'.) is an indicator function that equals one if Uj. ^ Uj'.. Note that (13) is 
no longer a convex optimization problem. 

We now establish a connection between (12) and (13). For a given value of A, (13) is 
equivalent to 


minimize 

UeK" Xr ,K , k i 1 .. ,n K eRP ,E 1 ,... ,E K 


\ E E Iix,. 

k—1 i£Ek 


K 

MfeII2 + ^ e E k , i' E k ), 

i<i' k—1 


(14) 


subject to the constraint that ..., n K } are the unique rows of U and E k = {i : Uj. = 
fi k }. Note that I(i £ E k ,i' ^ E k ) is an indicator function that equals to one if i £ E k and 
i' (ji E k . Thus, we see from (12) and (14) that fc-means clustering is equivalent to convex 
clustering with q = 0, up to a penalty term A X)i<j' Efc=i £ Ek, i' ^ E k ). 

To interpret the penalty term, we consider the case when there are two clusters E\ and 
E 2 . The penalty term reduces to A|£i| • (n— |£i|), where \E\ \ is the cardinality of the set E\. 
The term A|£d | • (n— \ E\ |) is minimized when \E\ \ is either lorn-1, encouraging one cluster 
taking only one observation. Thus, compared to fc-means clustering, convex clustering with 
q = 0 has the undesirable behavior of producing clusters whose sizes are highly unbalanced. 


3. Properties of Convex Clustering 

We now study the properties of convex clustering (3) with q £ {1,2}. In Section 3.1, we 
establish the range of the tuning parameter A in (3) such that convex clustering yields a 
non-trivial solution with more than one cluster. We provide finite sample bounds for the 
prediction error of convex clustering in Section 3.2. Finally, we provide unbiased estimates 
of the degrees of freedom for convex clustering in Section 3.3. 


3.1. Range of X that Yields Non-trivial Solution 


In this section, we establish the range of the tuning parameter A such that convex clustering 
(3) yields a solution with more than one cluster. 

Lemma 5. Let 


for q = 1, 


mm < max 


? {||(F Dx +( I -FDD T )u;) c( . i)) || 2 }) for q = 2. 


(15) 


Convex clustering (3) with q = 1 or q = 2 yields a non-trivial solution of more than one 
cluster if and only if A < A upp er ■ 
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By Lemma 5, we see that calculating A upp er boils down to solving a convex optimization prob¬ 
lem. This can be solved using a standard solver such as CVX in MATLAB. In the absence of such 
a solver, a loose upper bound on A upper is given by || A-Dx)^ for q = 1 , or max || A-Dc(i,i')x ||2 

for q = 2 . 

Therefore, to obtain the entire solution path of convex clustering, we need only consider 
values of A that satisfy A < A upper - 


3.2. Bounds on Prediction Error 


In this section, we assume the model x = u+e, where e £ is a vector of independent sub- 
Gaussian noise terms with mean zero and variance tr 2 , and u is an arbitrary np-dimensional 
mean vector. We refer the reader to pages 24-25 in Boucheron, Lugosi and Massart (2013) 
for the properties of sub-Gaussian random variables. We now provide finite sample bounds 
for the prediction error of convex clustering (3). Let A be the tuning parameter in (3) and 
let A' = 

np 

Lemma 6. Suppose that x = u+e, where e £ R" p and the elements of e are independent sub- 
Gaussian random variables with mean zero and variance cr 2 . Let u be the estimate obtained 

from (3) with q= 1. If A' > , then 


2 np 


u- u 


\l < ^-||Dii||i 


+ 


n 


I log (np) 
n 2 p 


holds with probability at least 1 — ^ 2 n ^ — exp min ^ci log (np), c-i y / plog(np)^ j, where c\ 
and C 2 are positive constants appearing in Lemma 10. 

We see from Lemma 6 that the average prediction error is bounded by the oracle quantity 
||Du|| i and a second term that decays to zero as n,p —» oo. Convex clustering with q = 
1 is prediction consistent only if A' II Dull i = o(l). We now provide a scenario for which 
A'llDulli =o(l) holds. 

Suppose that we are in the high-dimensional setting in which p > n and the true underlying 
clusters differ only with respect to a fixed number of features (Witten and Tibshirani, 2010). 
Also, suppose that each element of Du - that is, Uij — U,/j — is of order 0(1). Therefore, 
||Du|| i = 0{n 2 ), since by assumption only a fixed number of features have different means 


across clusters. Assume that 


ralog(p-(”)) 


= o(l). Under these assumptions, convex clustering 


with q = 1 is prediction consistent. 

Next, we present a finite sample bound on the prediction error for convex clustering with 

q = 2 . 

Lemma 7. Suppose that x = u+e, where e £ R np and the elements of e are independent sub- 
Gaussian random variables with mean zero and variance a 2 . Let u be the estimate obtained 


from (3) with q = 2. If A' > 4g^/ lc ’ ^ > 3 ^ 2 ^ , then 


^-!|u - u||! I! d c(+') u I |2 +« 

^ i<V 



log(np) 


n 2 p 
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holds with probability at least 1 - 

P\2 

and C 2 are positive constants appearing in Lemma 10. 

Under the scenario described above, ||Dc(i,i') u ll 2 = 0(1), and therefore ||Dc( i)i /)u || 2 = 


— exp 


min ^Ci log(np), c 2 \/plog(np)j j, where c\ 


0(n 2 ). Convex clustering with q 


2 is prediction consistent if 



o(l). 


3.3. Degrees of Freedom 


Convex clustering recasts the clustering problem as a penalized regression problem, for which 
the notion of degrees of freedom is established (Efron, 1986). Under this framework, we 
provide an unbiased estimator of the degrees of freedom for clustering. Recall that u is the 
solution to convex clustering (3). Suppose that Var(x) = o 2 I. Then, the degrees of freedom 
for convex clustering is defined as ^ Co(see, e.g., Efron, 1986). An unbiased 
estimator of the degrees of freedom for convex clustering with q = 1 follows directly from 
Theorem 3 in Tibshirani and Taylor (2012). 

Lemma 8. Assume that x ~ MVN(u, <r 2 I), and let u be the solution to (3) with q = 1. 
Furthermore, let B\ = {j : (Du)j 0}. We define the matrix D_^ by removing the rows of 
D that correspond to B±. Then 

dfr =tr(l-D^ i (D_g (16) 

is an unbiased estimator of the degrees of freedom of convex clustering with q = 1 . 

The following corollary follows directly from Corollary 1 in Tibshirani and Taylor (2011). 

Corollary 1. Assume that x ~ MVN(u, er 2 I), and let u be the solution to (3) with q = 1. 
The fit u has degrees of freedom 

dfi(u) = E [number of unique elements in u]. 


There is an interesting interpretation of the degrees of freedom estimator for convex clus¬ 
tering with q = 1. Suppose that there are K estimated clusters, and all elements of the 
estimated means corresponding to the K estimated clusters are unique. Then the degrees of 
freedom is Kp, the product of the number of estimated clusters and the number of features. 

Next, we provide an unbiased estimator of the degrees of freedom for convex clustering 
with q = 2 . 

Lemma 9. Assume that x ~ MVN(u, er 2 I), and let u be the solution to (3) with q = 2. 
Furthermore, let £> 2 = {(*,*') : ||Dc(.jy)u || 2 0}. We define the matrix D_g, by removing 

rows o/D that correspond to Z? 2 . Let P = ^1 — D 2 ^ (D_g 2 D T ^ )^D _g a J be the projection 
matrix onto the complement of the space spanned by the rows of D_g 2 . Then 


df 2 = tr 


I + AP £ 

(M')e ® 2 


/ D C(M') D C(M') 
\ ll D C(i,i')U||2 




|D c(m0 u||3 



is an unbiased estimator of the degrees of freedom of convex clustering with q = 2 . 


( 17 ) 
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When A = 0, |jDc( iji -)u|| 2 ^ 0 for all i < %'. Therefore, P = I £ R" pxnp and the degrees of 
freedom estimate is equal to tr(I) = np. When A is sufficiently large that is an empty set, 
one can verify that P = I — D T (DD r )^ D is a projection matrix of rank p, using the fact 
that rank(D) = p(n — 1) from Lemma l(i). Therefore df 2 = tr(P) = p. 

We now assess the accuracy of the proposed unbiased estimators of the degrees of freedom. 
We simulate Gaussian clusters with K = 2 as described in Section 4.1 with n = p = 20 and 
a = 0.5. We perform convex clustering with q = 1 and q = 2 across a fine grid of tuning 
parameters A. For each A, we compare the quantities (16) and (17) to 

1 rap 

^2 - u M x i ~ ( 18 ) 

3 =1 

which is an unbiased estimator of the true degrees of freedom, Xq=i Co v(v,j,Xj), averaged 
over 500 data sets. In addition, we plot the point-wise intervals of the estimated degrees of 
freedom (mean ±2 x standard deviation). Note that (18) cannot be computed in practice, 
since it requires knowledge of the unknown quantity u. Results are displayed in Figure 1. We 
see that the estimated degrees of freedom are quite close to the true degrees of freedom. 

(a) q=1 (b) q=2 




True Degrees of Freedom True Degrees of Freedom 


Fig 1. We compare the true degrees of freedom of convex clustering (x-axis), given in (18), to the proposed 
unbiased estimators of the degrees of freedom (y-axis), given in Lemmas 8 and 9. Panels (a) and (b) contain 
the results for convex clustering with q = 1 and q = 2, respectively. The red line is the mean of the estimated 
degrees of freedom for convex clustering over 500 data sets, obtained by varying the tuning parameter A. The 
shaded bands indicate the point-wise intervals of the estimated degrees of freedom (mean d z 2 x standard 
deviation), over 500 data sets. The black line indicates y = x. 


4. Simulation Studies 

We compare convex clustering with q — 1 and q = 2 to the following proposals: 

1. Single linkage hierarchical clustering with the dissimilarity matrix defined by the Eu¬ 
clidean distance between two observations. 
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2. The /c-means clustering algorithm (Lloyd, 1982). 

3. Average linkage hierarchical clustering with the dissimilarity matrix defined by the 
Euclidean distance between two observations. 

We implement convex clustering (3) with q = {1, 2} using the R package cvxclustr (Chi 
and Lange, 2014b). In order to obtain the entire solution path for convex clustering, we use 
a fine grid of A values for (3), in a range guided by Lemma 5. We apply the other methods 
by allowing the number of clusters to vary over a range from 1 to n clusters. To evaluate 
and quantify the performance of the different clustering methods, we use the Rand index 
(Rand, 1971). A high value of the Rand index indicates good agreement between the true 
and estimated clusters. 

We consider two different types of clusters in our simulation studies: Gaussian clusters 
and non-convex clusters. 

4-1. Gaussian Clusters 

We generate Gaussian clusters with K = 2 and K = 3 by randomly assigning each observa¬ 
tion to a cluster with equal probability. For K = 2, we create the mean vectors fi 1 = l p and 
fi 2 = — lp- For K = 3, we create the mean vectors = —3 • l p , pu 2 = 0 p , and fi 3 = 3 • l p . 
We then generate the nxp data matrix X according to X, ~ MVN(^t fe , cr 2 I) for i G D k . We 
consider n = p = 30 and a = {1,2}. The Rand indices for K = 2 and K = 3, averaged over 
200 data sets, are summarized in Figures 2 and 3, respectively. 

Recall from Section 2.2 that there is a connection between convex clustering and single 
linkage clustering. However, we note that the two clustering methods are not equivalent. From 
Figure 2(a), we see that single linkage hierarchical clustering performs very similarly to convex 
clustering with q = 2 when the signal-to-noise ratio is high. However, from Figure 2(b), we 
see that single linkage hierarchical clustering outperforms convex clustering with q = 2 when 
the signal-to-noise ratio is low. 

We also established a connection between convex clustering and /c-means clustering in 
Section 2.3. From Figure 2(a), we see that /c-means clustering and convex clustering with 
q = 2 perform similarly when two clusters are estimated and the signal-to-noise ratio is high, 
since in this case the penalty term dominates the first term in (14). In contrast, when the 
signal-to-noise ratio is low, the first term dominates the penalty term in (14). Therefore, 
when convex clustering with q = 2 estimates two clusters, one cluster is of size one and the 
other is of size n — 1, as discussed in Section 2.3. Figure 2(b) illustrates this phenomenon 
when both methods estimate two clusters: convex clustering with q — 2 has a Rand index of 
approximately 0.5 while /c-means clustering has a Rand index of one. 

All methods outperform convex clustering with q = 1. Moreover, k -means clustering and 
average linkage hierarchical clustering outperform single linkage hierarchical clustering and 
convex clustering when the signal-to-noise ratio is low. This suggests that the minimum 
signal needed for convex clustering to identify the correct clusters may be larger than that 
of average linkage hierarchical clustering and /c-means clustering. We see similar results for 
the case when K = 3 in Figure 3. 


4-2. Non-Convex Clusters 

We consider two types of non-convex clusters: two circles clusters (Ng, Jordan and Weiss, 
2002) and two half-moon clusters (Hocking et al., 2011; Chi and Lange, 2014a). For two 
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(a) Gaussian: K = 2, o = 1 



(b) Gaussian: K = 2, o = 2 



Fig 2. Simulation results for Gaussian clusters with K = 2, n = p = 30, averaged over 200 data sets, for 
two noise levels a = {1,2}. Colored lines correspond to single linkage hierarchical clustering (•** ), average 
linkage hierarchical clustering ( ), k-means clustering convex clustering with q — 1 ( ), and 

convex clustering with q = 2 


(a) Gaussian: K = 3, o = 1 


(b) Gaussian: K = 3, o = 2 




Fig 3. Simulation results for Gaussian clusters with K = 3, n = p = 30, averaged over 200 data sets, for 
two noise levels a = {1,2}. Line types are as described in Figure 2. 
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circles clusters, we generate 50 data points from each of the two circles that are centered at 
(0, 0) with radiuses two and 10, respectively. We then add Gaussian random noise with mean 
zero and standard deviation 0.1 to each data point. For two half-moon clusters, we generate 
50 data points from each of the two half-circles that are centered at (0, 0) and (30, 3) with 
radius 30, respectively. We then add Gaussian random noise with mean zero and standard 
deviation one to each data point. Illustrations of both types of clusters are given in Figure 4. 
The Rand indices for both types of clusters, averaged over 200 data sets, are summarized in 
Figure 5. 

We see from Figure 5 that convex clustering with q = 2 and single linkage hierarchical 
clustering have similar performance, and that they outperform all of the other methods. 
Single linkage hierarchical clustering is able to identify non-convex clusters since it is an 
agglomerative algorithm that merges the closest pair of observations not yet belonging to 
the same cluster into one cluster. In contrast, average linkage hierarchical clustering and k- 
means clustering are known to perform poorly on identifying non-convex clusters (Ng, Jordan 
and Weiss, 2002; Hocking et ah, 2011). Again, convex clustering with q = 1 has the worst 
performance. 


(a) Two Circles 

(b) Two Half-Moons 

.. *• -n 

•* •• 



M • 

• % 


* 

x O •• 
• * 


1 * 

V. J 

• / 


\ . y 

•. «• 




Fig 4. Illustrations of two circles clusters and two half-moons clusters with n = 100. 


(a) Two Circles (b) Two Half-Moons 



Fig 5. Simulation results for the two circles and two half-moons clusters with n = 100, averaged over 200 
data sets. Line types are as described in Figure 2. 
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4-3. Selection of the Tuning Parameter A 

Convex clustering (3) involves a tuning parameter A, which determines the estimated number 
of clusters. Some authors have suggested a hold-out validation approach to select tuning 
parameters for clustering problems (see, for instance, Tan and Witten, 2014; Chi, Allen and 
Baraniuk, 2014). In this section, we present an alternative approach for selecting A using the 
unbiased estimators of the degrees of freedom derived in Section 3.3. 

The Bayesian Information Criterion (BIC) developed in Schwarz (1978) has been used 
extensively for model selection. However, it is known that the BIC does not perform well 
unless the number of observations is far larger than the number of parameters (Chen and 
Chen, 2008, 2012). For convex clustering (3), the number of observations is equal to the 
number of parameters. Thus, we consider the extended BIC (Chen and Chen, 2008, 2012), 
defined as 

eBIC g , 7 = np ■ log + df g • log (np) + 2y • df 9 • log(rap), (19) 

where RSS g = ||x — u g |||, is the convex clustering estimate for a given value of q and A, 
7 £ [0,1], and df 9 is given in Section 3.3. Note that we suppress the dependence of u 9 and 
df 9 on A for notational convenience. We see that when 7 = 0, the extended BIC reduces to 
the classical BIC. 

To evaluate the performance of the extended BIC in selecting the number of clusters, 
we generate Gaussian clusters with K = 2 and K = 3 as described in Section 4.1, with 
n = p = 20, and a = 0.5. We perform convex clustering with q = 2 over a fine grid 
of A, and select the value of A for which the quantity eBIC 9j7 is minimized. We consider 
7 £ {0, 0.5,0.75,1}. Table 1 reports the proportion of datasets for which the correct number 
of clusters was identified, as well as the average Rand index. 

From Table 1, we see that the extended BIC is able to select the true number of clusters 
accurately for K — 2. When K = 3, the classical BIC (7 = 0) fails to select the true number 
of clusters. In contrast, the extended BIC with 7 = 1 has the best performance. 

Table 1 

Simulation study to evaluate the performance of the extended BIC for tuning parameter selection for 
convex clustering with q = 2. Results are reported over 100 simulated data sets. We report the proportion 
of data sets for which the correct number of clusters was identified, and the average Rand index. 



eBIC2, 7 

Correct number of clusters 

Rand index 

Gaussian clusters, K = 2 

7 = 0 

0.94 

0.9896 


7 = 0.5 

0.98 

0.9991 


7 = 0.75 

0.99 

0.9995 


7 = 1 

0.99 

0.9995 

Gaussian clusters, K = 3 

7 = 0 

0.06 

0.7616 


7 = 0.5 

0.59 

0.9681 


7 = 0.75 

0.70 

0.9768 


7 = 1 

0.84 

0.9873 


5. Discussion 

Convex clustering recasts the clustering problem into a penalized regression problem. By 
studying its dual problem, we show that there is a connection between convex clustering and 
single linkage hierarchical clustering. In addition, we establish a connection between convex 
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clustering and fc-means clustering. We also establish several statistical properties of convex 
clustering. Through some numerical studies, we illustrate that the performance of convex 
clustering may not be appealing relative to traditional clustering methods, especially when 
the signal-to-noise ratio is low. 

Many authors have proposed a modihcation to the convex clustering problem (1), 

n 

minimize - V ||X 4 . - + AQ,(W,U), (20) 

UGt nXp Z Z —' 
i—1 

where W is an nxn symmetric matrix of positive weights, and Q q (W, U) = W 7 ,;,:/ ||U t . — 

Uj/JIg (Pelckmans et al., 2005; Hocking et al., 2011; Lindsten, Ohlsson and Ljung, 2011; Chi 
and Lange, 2014a). For instance, the weights can be defined as W,^ = exp (—<)>||X,;. — X.j/jH) 
for some constant <j> > 0. This yields better empirical performance than (1) (Hocking et al., 
2011; Chi and Lange, 2014a). We leave an investigation of the properties of (20) to future 
work. 
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Appendix A: Proof of Lemmas 2—3 


Proof of Lemma 2: 

Proof. We rewrite problem (3) as 


1 


||x — u 11 2 + AP g (r 7 1 ) subject to rj 1 = Du, 


minimize 


ueR T *r\?7 1 eR 



with the Lagrangian function 



(A-l) 


where u £ IR^'W] is the Lagrangian dual variable. In order to derive the dual problem, we 


need to minimize the Lagrangian function over the primal variables u and r] 1 . Recall from 
(4) that P*(-) is the dual norm of P g (-). It can be shown that 





otherwise. 


minimize -||x — D T u\\l subject to P*(iz) < A. 


(A-2) 



We now establish an explicit relationship between the solution to convex clustering and 
its dual problem. Differentiating the Lagrangian function (A-l) with respect to u and setting 
it equal to zero, we obtain 

u = x — D 7 £>, 

where i> is the solution to the dual problem, which satisfies P g (£0 < A by (A-2). Multiplying 
both sides by D, we obtain the relationship ( 6 ). □ 

Proof of Lemma 3: 

Proof. We rewrite (7) as 


minimize -||Dx- 7 ||2 + AP 0 (r,,) 

TeR Hs)U eR Hs)] 2 


subject to r ] 2 = 7 , 


with the Lagrangian function 


£(7, V 2 : ”') = ^ll Dx - Till + AP g (T7 2 ) + {y') T {l ~ rj 2 ), 


(A-3) 
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where v' £ is the Lagrangian dual variable. In order to derive the dual problem, we 

minimize the Lagrangian function over the primal variables 7 and r\ 2 . It can be shown that 


ri 2 e 


inf 

dIW 


£(7. V2,”') = 


2 II 
—00 


II2 "I" ) I 


otherwise, 


and 


inf 


—00 


Therefore, the dual problem for (7) is 

minimize — 1 |Dx — 1 || subject to P>') < A- (A-4) 

w / 6 bH2)] 2 


We now establish an explicit relationship between the solution to (7) and its dual problem. 
Differentiating the Lagrangian function (A-3) with respect to 7 and setting it equal to zero, 
we obtain 

7 = Dx — v , 

where v is the solution to the dual problem, which we know from (A-4) satisfies P*(i> ) < 
A. □ 


Appendix B: Proof of Lemma 5 
Proof of Lemma 5: 

Proof. Since D is not of full rank by Lemma 1 (i) , the solution to (5) in the absence of 
constraint is not unique, and takes the form 

v = (DD T ) t Dx + (I - D(D T D) t D T )w 
= (D T )^x + (I — DD^)w ( B -l) 

= -Dx+ (I - -DD t )u, 

n n 

for uj £ kI p '( 2 )]. The second equality follows from Lemma l(iii) and the last equality follows 
from Lemma 1 (ii) . 

Let u be the solution to (3). Substituting v given in (B-l) into ( 6 ), we obtain 

Du = Dx — DD t i> 

= Dx - —DD r Dx - DD t u + -DD t DD t w 

n n 

= Dx — Dx DD t ll> + DD t w 

= 0 . 

Recall from Definition 1 that all observations are estimated to belong to the same cluster if 
Du = 0. For any u in (B-l), picking A = P*(i>) guarantees that the constraint on the dual 
problem (5) is inactive, and therefore that convex clustering has a trivial solution of Du = 0. 
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Since u is not unique, P*(iz) is not unique. In order to obtain the smallest tuning parameter 
A such that Du = 0, we take 

Aupper := min P* ( -Dx + ( I — -DD 1 ) u 
wg R['-(S)] V- v n ) 

Any tuning parameter A > A upp er results in an estimate for which all observations belong to 
a single cluster. The proof is completed by recalling the definition of the dual norm P*(-) in 
(4). □ 


Appendix C: Proof of Lemmas 6—7 


To prove Lemmas 6 and 7, we need a lemma on the tail bound for quadratic forms of 
independent sub-Gaussian random variables. 

Lemma 10. (Hanson and Wright, 1971) Let z be a vector of independent sub-Gaussian 
random variables with mean zero and variance a 2 . Let M be a symmetric matrix. Then, 
there exists some constants ci, C 2 > 0 such that for any t > 0, 

Pr (zTMz £ 1 + ,Ar(M) > £ exp {- min (iJfkc } • 

where || • ||f and || • || sp are the Frobenius norm and spectral norm, respectively. 

In order to simplify our analysis, we start by reformulating (3) as in Liu, Yuan and Ye 
(2013). Let D = AAVj be the singular value decomposition of D, where A G 
a G ]RP(n-i)xp(n-i) ) and V/3 e R npxp(n-1) Construct V a G R npxp such that V = [V„, V^] G 
K npxnp is an orthogonal matrix, that is, V T V = VV T = I. Note that = 0. 

Let (3 = Vju G and a = V^u G R p . Also, let A' = ^. Optimization problem 

(3) then becomes 

minimize 7T^II X — V Q a — V p(3\\ 2 + X l P q (Z(3), (C-l) 

aeRp./aeKpf"- 1 ) 2 np 


where Z = AA G 1 ). Note that rank(Z) = p(n — 1) and therefore, there exists 

a pseudo-inverse Z^ G K p O _1 l x fc’G)] such that Z^Z = I. Recall from Section 1 that the set 
C(i,i') contains the row indices of D such that = LR. — U p,- Let the submatrices 

Z C(M0 and Z^ ^ denote the rows of Z and the columns of Zf, respectively, corresponding 
to the indices in the set C(i,i'). By Lemma l(v), 


A min (Z) = A min (D) = 


1 


A max (Zt) 


— y/n and A max (Z) — A max (D) — 


A m i n (Zt 


Let a and (3 denote the solution to (C-l). 


\fn. 

(C-2) 


Proof of Lemma 6: 

Proof. We establish a finite sample bound for the prediction error of convex clustering with 
q = 1 by analyzing (C-l). First, note that u = V Q o; + Vp/3 and u = V a a + V^/3. Thus, 
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5 ^||u - u|| 2 = ^||V Q (a - a) + V P 0 - (3)\\ 2 . Recall from (2) that Pi(Z/3) = \\Zf3\h- By 
the definition of a. and 0, we have 


^llx - (V Q a + W p 0)f + A'||Z/3||! < ^_||x - (V Q c* + Yp0)f + \'\\Z0\\ U 


implying 


-^||V a (a - a) + Vp0 - 0)\\ 2 + A'llZ^Hi < -lG(dJ) + A , ||Z/3|| 1 , (C-3) 

Znp np 


where G(d, 0) = e T V a (d - a ) + Vp(0-0) . Recall that V^V a = I and = 0. By 

the optimality condition of (C-l), 

d = V^(x-V^) 

= Vl (v a cx + \pP + e-VpP) 

= a + V^e. 

Therefore, substituting a — a. = V^e into G(a,0), we obtain 


— |G(a,0)1 = — \e T fv a (d - a) + W p (0 - 0)\ I 

np I I np I L JI 


np 

= — I e T V Q V^e + e T V /3 (0 - /3) I 
np I I 

< — e T V a Vle+ — |e T V^(^-/3)j 

np np I I 

= —e T V a Vle + — |e T V /3 Z t Z0 - /3)j 
np np I I 

< —e T V a Vle+ —||e T V^Z t || 00 ||Z(R-/3)||i. 

np np 


We now establish bounds for ^e T V a Vle and — ||e T V^Zi|| cx) that hold with high proba¬ 
bility. 

Bound for —e T V Q Vle: 

np lx 

First, note that V Q Vl is a projection matrix of rank p, and therefore ||V a Vl||sp = 1 and 
||V q VI|| f = p. By Lemma 10 and taking z = e and M = V Q Vl, we have that 

Pr (e T V a Vle > t + a 2 p ) < exp j-min ^ 

where c\ and C 2 are constants in Lemma 10. Picking t = a 2 y/ plog(np), we have 

7 1 


Pr | —e T V a Vle > - 2 


np 


1 / log (np) 
n V n2 P 


< exp | - min (ci log (np), c 2 \J p log (np)) j . 


Bound for — I |e T V«Z^ 


(C-4) 


0A’ lloo- 


Let ej be a vector of length p • (™) with a one in the jth entry and zeroes in the remaining 
entries. Let Vj = ej (Zt) T Vj e. Using the fact that A max (V^) = 1 and A max (Zi) = (C-2), 
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we know that each Vj is a sub-Gaussian random variable with zero mean and variance at 

2 

most —. Therefore, by the union bound, 


Pr ^max \vj\ > z'j < p ■ ■ Pr (\vj\ > z) < 2p ■ exp 


nz 

' 2^2 


Picking ^ = 2.v/W, 


we obtain 


Pr 



> 2(7 


log(p-(S)) 



(C-5) 


Combining the two upper bounds: Setting X' > 4 cry— ^ 3 p 2 2 ^ and combining the results 
from (C-4) and (C-5), we obtain 


— G(a,j3) < a 2 
np 


llog(np) 

n 2 p 


+ ^\\Z0-^\\i 


(C-6) 


with probability at least 1 — (TpT ~ exp | —min ^ci log(np) , c 2 -\/p log(np)^ j. Substituting 


(C-6) into (C-3), we obtain 


— ||V a (d - a) + Vf,0 - (3)\\ 2 + A'Mi 


/ Z 

< a 


1 

-b 

n 


log (np) 


+ ^||Z0-/3)|| 1 + A'||Z/3|| 1 . 


We get Lemma 6 by an application of the triangle inequality and by rearranging the terms. 

□ 


Proof of Lemma 7: 

Proof. We establish a finite sample bound for the prediction error of convex clustering with 
q = 2 by analyzing (C-l). Recall from (2) that P 2 (Z/3) = l|Zc(i,i')/3||2- By the definition 

of a and /3, we have 

^||x-(V Q d+V^)|| 2 +A' £ \\Z c(iy) Ph < ^H|x-(V Q a+V^/3)|| 2 +A' £ \\Z c{iy) (3\\ 2 , 

^ i<i’ ^ i<i' 


implying 
1 


1 


2n ll V Cf (a-a)+V^0-/3)|| 2 + A / ^||Z c(M q^|| 2 < — G(a J)+A'^ ||Z c(m0 /3|| 2 , (C-7) 

” i<i' ” i<i' 


where G(d, 0 ) = e T V a (a: — a) + V^(/3 — (3) . Again, by the optimality condition of (C-l), 
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we have that a. — a = V^e. Substituting this into ^G(d,/3), we obtain 
— |G(d,j3)| = — \e T [v a (a-a) + V fl 03-/3)l| 

np I I np I L JI 


np 

1 

np 


r V a Vle + e T Vp0-P)\ 


< — e T V Q Vle+ — |e T V^(/9 — /3)| 

np np I I 

= — e T V a Vle + — |e T V^Z t Z03 - /3)| 

np np I I 


1 T -\t -* tT 1 

— —e V a V a e H- 

np np 


< 


E( eTv ^ Z c (i , i '))( Z cb, i ')(^-/3)) 

i<i' 

i-6 T V Q Vle + -L J2i( e T V fJ Zj, (i . r) )(Z C(iy) 0 - (3)) 


np 

< —e T 

np 


np 


i<i' 


V Q e + — £ ||6 T V / 3Zt ( . . /) || 2 ||Z c{i , i , ) ( ( 3 - S)h 


np 

X<1 


< ^ T V°y T a e + T ■ max \\e T Vf,Z' c || 2 £ ||Z c(i , o 0 - /3)|| 2 , 

IbU I b/J 2%2 


i<i' 


where the second inequality follows from an application of the triangle inequality and the 
third inequality from an application of the Cauchy-Schwarz inequality. We now establish 
bounds for -i-e T V Q Vje and • max ||e T V ) gZL, .,J| 2 that hold with large probability. 

Bound for — e T 'V a V^e: 

This is established in the proof of Lemma 6 in (C-4), i.e., 

1 rp 11 1 


Pr 


np 


e Vo-V^e > a 


+ 


/log(np) 

n 2 p 


< 


np 


Bound for — • max ||e T V,gZL. .,J| 2 : 

np i<v II P C(i,l')U z 

First, note that there are p indices in each set Therefore, 

Note that 

— • max ||e T V i aZ^ ( i0 || 2 < 

Therefore, using (C-8), 


~ 2 ~ ■ max ||e T V^zL i n 

n z p i<i' ^ ’ > 


= \/^-|I^Z t IU. (C-8) 


n*p 


Prl ^T<“ l|eTv<,z fo<') |l2 - 2< ’' 


M pi©) 

n 3 p 


<Pr Ik^ZtlU >2a\ - 


!°g(p- ( 2 )) 


(C-9) 


< 
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where the last inequality follows from (C-5) in the proof of Lemma 6. 

Therefore, for A' > 4cr \/ los ^f 3 ^ 2 ^ , we have 4- < -f- -max ||e T V«zL. J| 2 with probability 

V n P z n P ^ )'< 

at most Combining the results from (C-4) and (C-9), we have that 


— G(a,/3) < a 2 
np 


I log {np) 

n 2 p 


A' 


+ ^El|Zc ( M' ) 0-/3)l| 2 (C-10) 


i<i' 


with probability at least 1 — ~ exp | —min ^ci log(np) , c 2 -\/p log(np)^ |. Substituting 


(C-10) into (C-7) , we obtain 


— ||V Q (a - a) + Vf ,&- /3)|| 2 + A' E ||Z c(M0i a|| 2 


/ Z 

< a 


+ 


log {np) 

n 2 p 


A' 


+ tE ll z c(i,i')(^ - 0)h + E II Z C(M')/3||2- 


We get Lemma 7 by an application of the triangle inequality and by rearranging the terms. 

□ 


Appendix D: Proof of Lemma 9 


Proof of Lemma 9: 


Proof. Directly from the dual problem (5), D T z> is the projection of x onto the convex set 
K = {D 7 v : P^i') < A}. Using the primal-dual relationship u = x — D 7 v , we see that u is 
the residual from projecting x onto the convex set K. By Lemma 1 of Tibshirani and Taylor 
(2012), u is continuous and almost differentiable with respect to x. Therefore, by Stein’s 
formula, the degrees of freedom can be characterized as E [tr (§^)] ■ 

Recall that Ddenotes the rows of D corresponding to the indices in the set C{i,i'). 
Let Z ?2 = {(*, i r ) ■ ||Dc(»,»')'*ll 2 0}. By the optimality condition of (3) with q = 2, we obtain 

(x - u) = A E D C(M')SC(M'), (D- 1 ) 

i<i' 


where 




ll D C(«,C) <i ll 2 

e {r : ||T || 2 < 1} 


if (M')eB 2 . 
If (i,i')$B 2 . 


We define the matrix D_g 2 by removing the rows of D that correspond to elements in 1? 2 . 
LetP=(l-D^(D_j D^)tD_ k e the projection matrix onto the complement of 
the space spanned by the rows of D_^ . 

By the definition of D_g 2 , we obtain D_g 2 U = 0. Therefore, Pu = u. Multiplying P onto 
both sides of (D-l), we obtain 


Px - u = AP E D I(i, i')9c(i,i') 

i<i' 


= AP E 

(M')6B 2 


D c(«,<o D c«,iQ" 

l|Dc(i,»')U||2 


(D-2) 
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where the second equality follows from the fact that PD^ ^ = 0 for any ( i , i') ^ £> 2 . 

Vaiter et al. (2014) showed that there exists a neighborhood around almost every x such 
that the set B 2 is locally constant with respect to x. Therefore, the derivative of (D-2) with 
respect to x is 


< 9 u _ N p f ^C(i,i')^C(i,i') 

^ { ||D c(iy) u|| 2 


D 




,uu t D 


C(v 


iD, 




I ID, 


C(i,i 0 1 


S' < D - 3 » 


d A 1 Av _ A A A'Avv J A J A 


using the fact that for any matrix A with ||Av|| 2 ^ 0, gv ||Av||a - ||Av||2 ||Av 

Solving (D-3) for we have 


<9u 

dx 


I + AP £ 






|D C(M0 U|| 2 ||D c(iy) u||l 


P. 

(D-4) 


Therefore, an unbiased estimator of the degrees of freedom is of the form 

( 


tr 


9u 

dx 


= tr 


1 -1 


I + AP X 




ll D C(i,i')U || 2 


l D C(i,i')U||| 


□ 
















